Accelerate Distributed Learning all about
RDMA, ps-lite, Distribute training, Parameter Server and Ring All-reduce
- presenter
Jingrong Chen
- time:
What i learn
2.IRN (Revisiting Network Support for RDMA)
3.RoCEv2 + PFC -> DCQCN
5.Distributing Training
- Data distributed
- Model distributed
6.Communication Structure
- Parameter Server (Tensorflow)
- Ring All-reduce
7.Parameter Server
Nature: KVStore
- Asynchronized Update
- make fault tolerance easily
8.Ring All-reduce
- No fault tolerance
- Not suitable for cloud
Implementations:- Tensorflow + Uber Horovod
- Baidu ring-allreduce(not available)
9. ps-lite
MXNet and ps-lite are decoupled, which means:
- No memory management in ps-lite
- No assumption on tensor size -> need rendezvous mode
- 1 vs. N communication
10. Programming on Verbs
Memory must be registered before use -> manage memory manually
Work completion handler cannot block the CQ polling thread -> thread poll / coroutine
Number of outstanding SR cannot exceed the SQ size, as well as number of outstanding RR on the remote side -> flow control
Small and large message -> Eager mode & Rendezvous mode
For more
This is Yiqing Ma ‘s website.
If life deals you lemons, make lemonade….